Faster pattern matching with character classes using prime number encoding

نویسندگان

  • Chaim Linhart
  • Ron Shamir
چکیده

In pattern matching with character classes the goal is to find all occurrences of a pattern of length m in a text of length n, where each pattern position consists of an allowed set of characters from a finite alphabet Σ. We present an FFT-based algorithm that uses a novel prime-numbers encoding scheme, which is log n/ log m times faster than the fastest extant approaches, which are based on boolean convolutions. In particular, if m|Σ| = nO(1), our algorithm runs in time O(n log m), matching the complexity of the fastest techniques for wildcard matching, a special case of our problem. A major advantage of our algorithm is that it allows a tradeoff between the running time and the RAM word size. Our algorithm also speeds up solutions to approximate matching with character classes problems — namely, matching with k mismatches and Hamming distance, as well as to the subset matching problem.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast and Simple Character Classes and Bounded Gaps Pattern Matching, with Applications to Protein Searching

The problem of fast exact and approximate searching for a pattern that contains classes of characters and bounded size gaps (CBG) in a text has a wide range of applications, among which a very important one is protein pattern matching (for instance, one PROSITE protein site is associated with the CBG [RK] - x(2,3) - [DE] - x(2,3) - Y, where the brackets match any of the letters inside, and x(2,...

متن کامل

Matching for Run-Length Encoded Strings

1 Motivation Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of nding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y i...

متن کامل

Fast search in DNA sequence databases using punctuation and indexing

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences...

متن کامل

Recognition of an Indian Script Using Multilayer Perceptrons and Fuzzy Features

We present a multi-stage character recognition system for an Indian script, namely, Bengali (also called Bangla) using fuzzy features and multilayer perceptrons (MLP). The fuzzy features are extracted from Hough transform of a character pattern pixels. We first define a number of fuzzy sets on the Hough transform accumulator cells. The fuzzy sets are then combined by t-norms to generate feature...

متن کامل

Vehicle License Plate Recognition System

The vehicle license plate recognition system has greater efficiency for vehicle monitoring in automatic zone access control. This Plate recognition system will avoid special tags, since all vehicles possess a unique registration number plate. A number of techniques have been used for car plate characters recognition. This system uses neural network character recognition and pattern matching of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • J. Comput. Syst. Sci.

دوره 75  شماره 

صفحات  -

تاریخ انتشار 2009